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Abstract 

Principals and teachers do not use large-scale assessment results because the lack of dis¬ 
tinct and reliable subtests prevents identifying strengths and weaknesses of students and 
instruction, the results arrive too late to be used, and principals and teachers need assis¬ 
tance to use the results to improve instruction so as to improve student learning. There¬ 
fore, it is recommended that the first assessment activity should be to clearly establish 
that the domain to be assessed is multidimensional. Given this, the assessment schedule 
should be changed so that a given subject area is assessed in non-consecutive years but 
the number of sittings remains the same each year. Assistance should be provided to prin¬ 
cipals and teachers so as to increase their understanding of how to use large-scale assess¬ 
ment results. Three suggested assessment cycles are presented, each of which increases 
the reliability of subtests and provides principals and teachers with at least two years to 
make changes in instruction. 
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Resume 

Les directeurs et enseignants n’utilisent pas les resultats devaluations a grande echelle 
car le manque de sous-tests fiables et distincts empeche [’identification des forces et 
faiblesses des etudiants et de 1’instruction, les resultats arrivent trop tard pour etre utilises 
et les directeurs ainsi que les enseignants ont besoin d’aide pour exploiter les resultats 
afin d’ameliorer l’instruction, de maniere a ameliorer l’apprentissage des eleves. II est 
done recommande que la premiere activite d’evaluation consiste a etablir clairement que 
le domaine a evaluer est multidimensionnel. Ceci etant etabli, le programme d’evalua- 
tion devrait etre modifie de maniere qu’un domaine donne soit evalue lors d’annees non 
consecutives, mais que le nombre de seances demeure identique chaque annee. Une aide 
doit etre apportee aux directeurs et enseignants afin d’accroitre leur comprehension de 
la fag on d’utiliser les resultats devaluations a grande echelle. Trois cycles devaluation 
suggeres sont presentes, chacun augmentant la fiabilite des sous-tests et donnant aux 
directeurs et enseignants au moins deux ans pour apporter des modifications aux meth- 
odes destruction. 

Mots-cles : evaluations a grande echelle, sous-tests, problemes, solutions, 
multidimensionnel 


This article was first presented as a Presidential Address delivered to members of the Canadian Educational 
Researchers’ Association (CERA) during the annual conference of the Canadian Society for the Study of 

Education (CSSE), in Victoria, June 3, 2013. 


Cet article fut presente pour la premiere fois sous la forme d’un message du president aux membres de 
TAssociation canadienne de chercheurs en education (ACCE) lors de la conference annuelle de la Societe 
canadienne pour l’etude de l’education (SCEE), a Victoria, C.-B., le 3 juin 2013. 
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Introduction 

Government policy makers and officials in Canada believe public evidence of student 
performance drawn from sound and credible large-scale assessments will help focus 
educators’ attention on improving curriculum and instruction and, as result, enhance 
student learning and performance. While there are many who agree that data-infonned 
decision making is a way to improve curriculum and instruction so that student learning 
is improved, others question the perceived value of large-scale assessments as a way to 
improve curriculum and instruction so that student learning is improved. The purpose of 
this article is to briefly review the beneficial and detrimental effects of large-scale testing, 
discuss three existing concerns, and propose a change in the scheduling of large-scale 
assessments to properly address the three concerns. 


Beneficial Effects of Large-Scale Assesments 

Briefly, students across schools are treated equitably and fairly by providing a com¬ 
mon “yardstick” in the form of a common assessment (Phelps, 2008; O’Conner, 2009). 
Large-scale assessment results have positively affected the need for increased attention 
to students with special needs (Roderick & Engel, 2001; Thurlow & Ysseldyke, 2001). 
Providing assessment results to students, be they classroom assessments or large-scale 
assessments, has a strong positive effect on student achievement (Phelps, 2012). After 
being involved in item writing, item review panels, sensitivity review panels, and scoring 
students’ responses to open-ended items, principals and teachers return to their schools 
and classrooms with enhanced training and experience in item writing and scoring that 
they can apply to their own classroom assessments (Cizek, 2001). Further, they can use 
large-scale assessment results to identify the need for professional development presenta¬ 
tions and workshops (Cizek, 2001). Members of departments of education and school dis¬ 
trict personnel can use large-scale assessment results to confirm that the curriculum has 
been addressed effectively (Lissitz & Schafer, 2002). The presence of publicly reported 
large-scale assessment results can serve as a conversation starter for a discussion about 
what should constitute an accountability system, the implementation of which is vital 
if the goal is to improve education and achievement of students (Cizek, 2001; Ferrera, 
2005; Mirazchiyski, 2013; Paton, 2013). 
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Detrimental Effects of Large-Scale Assesments 

Among the points of concern for large-scale assessments are that large-scale assess¬ 
ments narrow instructional content with a concomitant emphasis on students learning 
lower order thinking skills at the expense of higher order thinking skills; reduce instruc¬ 
tional time in favour of test preparation activities; yield results for students who are no 
longer the teachers’ students since the students move to the next grade or to a junior or 
senior high school; increase cheating; and reduce teacher professionalism (Brandt, 1995; 
Burrows, Groce & Webeck, 2005; Chester, 2005a, 2005b; Darling-Hammond, Ancess, 

& Falk, 1994; Earl & Katz 2006; Kohn, 2000; National Council on Measurement in 
Education, 2012; Popham, 2002; Shepard, 1991, 2010; Wiggins & McTighe, 2005). 

These concerns continue to be expressed and, with the introduction of programs like No 
Child Left Behind (United States Department of Education, 2003) and Race for the Top 
(United States Department of Education, 2009), they now include restrictions on teacher 
authority; questionable evaluations of school personnel and teacher stress; unwarranted 
reductions of teacher salaries; school sanctions; neglect of content not covered by the 
assessments (e.g., if science is not assessed, then perhaps science is not all that import¬ 
ant to leam); inconsistent performance standards and cut-scores across grades and over 
years; and inconsistent school results over time (e.g., Berliner, Popham, & Shepard, 2000; 
Burrows et ah, 2005; Chester, 2005a; Childs & Fung, 2009; Cizek, 2001; Kane & Staiger, 
2002; Klinger & Rogers, 2011; Klinger, Shula, & Wade-Wooley, 2009; Linn, 2003; 
Thompson, 2001). Ravitch (2010) compellingly summarizes these concerns, concluding 
that curriculum and instruction are far more important than large-scale assessments and 
that testing has become an end in itself and not a means to the end. 


The Situation in Canada 

The majority, but not all, of the research cited above was conducted in the United States 
While a similar situation exists in Canada, there are two important differences between 
the provinces/territories in Canada and all but a few states (e.g., Hawaii, Texas) in the 
United States. First, the curriculum for a subject area in each province/territory is com¬ 
mon to all schools in the province. Teachers in all schools in the province/territory must 
provide learning opportunities to their students to enable them to learn the knowledge 
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and acquire the thinking, problem solving, and reasoning skills identified in the learning 
expectations provided in the program of studies or curriculum guide. Consequently, large- 
scale provincial/territorial assessments can be used to improve instruction but not the 
curriculum at the school level. Second, the sanctions that exist in the United States are not 
present to the same degree in Canada. While some teachers may elect not to teach a grade 
that has a provincial/territorial assessment, accountability in Canada is commonly framed 
within the context of professional responsibility, with the expectation that principals and 
teachers will use the results of large-scale assessments to inform and support their own 
ongoing school improvement efforts to improve student learning and performance. 

However, the contention in this article is that large-scale assessment results 
in Canada and elsewhere will become more useful to the extent that they provide the 
following: 

a. relevant and justifiable evidence to foster a conversation about how to 
improve instruction of all students; 

b. adequate time for principals and teachers so that they can meaningfully 
engage in sound and valid planning and implementation of needed 
instructional changes; and 

c. assistance to teachers and principals to help them use the information from 
large-scale assessment and to integrate the information with information 
gained from their own classroom assessments. 

It is argued in this article that these three points can be effectively addressed by 
providing credible diagnostic information and more time for and assistance to principals 
and teachers to allow them to use these this information to improve instruction in ways 
that enhance student learning and achievement. 


Three Issues That Need To Be Addressed 

Lack of Credible Diagnostic Information 

To start a conversation about how to improve instruction for all students, principals and 
teachers need reliable “diagnostic” information that they can validly interpret in terms of 
strengths and weaknesses of their students. They clearly recognize that subtest scores will 
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allow them to see what changes are needed to improve what is taught and how it is taught 
so as to improve their students’ learning (Hattie & Timperley, 2007). As indicated above, 
providing feedback to students that they can use to identify their own strengths and weak¬ 
nesses leads to improved student learning and performance (Phelps, 2012). 

The premise for reporting subtest scores from curriculum-based assessments is 
based on the assumption that the curriculum is multidimensional. For example, most 
mathematics curricula are divided into five or so content subdomains such as number 
sense and numeration, measurement, geometry and spatial sense, patterning and algebra, 
and data management and probability. As well, most curricula include a cognitive com¬ 
ponent such as knowledge and application, problem solving, reasoning, and evaluation. 
Clearly, the names convey differences among the content subdomains and the cognitive 
levels. But are the content and cognitive subdomains distinct or not ? 

To answer this question, the first activity in the test development process should 
be to clearly establish domain clarity. Deliberate effort needs to be devoted at the very 
beginning of the assessment process to detennine if the domain to be assessed is unidi¬ 
mensional or multidimensional. If the domain is found to be unidimensional, then there 
is no warrant to report sub test scores. If the domain is found to be multidimensional, 
then there is a warrant to report subtest scores. However, what most frequently happens 
is that an implicit assumption is made that the curriculum is multidimensional (Halady- 
na & Kramer, 2004). Items are carefully developed for each subdomain such that each 
item is relevant to the subdomain and the set of relevant items represents the subdomain 
(Messick, 1989). But the total number of items for each subdomain is limited so that the 
full test can be administered in two to three hours, given the age of the students to be 
assessed. 

Despite the small number of subtest items, the question “Can we report sub¬ 
scores?” often arises after the assessment has been administered. At this point, methods 
for empirically determining if subtest scores are distinct or if they add value over the total 
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score are used. 1 Generally, the findings reveal that reporting subtest scores for current 
large-scale assessments is not warranted. The common reasons are high subtest correla¬ 
tions and low subtest reliabilities. Greg Cizek, in his National Council of Measurement 
and Evaluation Presidential Address (April 29, 2013), firmly stated that reporting subtest 
scores from current assessment instruments and tests was simply inappropriate. 

What is wanted by principals and teachers is displayed in Figure 1. The five 
subdomains have been determined to have a good amount of uniqueness by a panel of 
subject matter experts who know well the knowledge and skills of the domain and its 
subdomains to be learned and the characteristics of the students to be taught. Further, the 
subtests developed to measure each of the subdomains are composed of relevant and rep¬ 
resentative items. First consider Students A and B. The performance of Student A across 
the five parts of the curriculum is uniformly low and the performance of Student B across 
the five parts of the curriculum is uniformly high. Students like Students A and B learn 
the knowledge and skills of each subdomain equally well, but at different levels of perfor¬ 
mance. Students like Student C do not leam the knowledge and skills of each subdomain 
equally well. Teachers can use the profiles of students like Student A to develop a full re¬ 
medial plan to help erase their general low perfonnance and a remedial plan for students 
like Student C to address the subdomains with low performance while maintaining their 
performance for the subdomains with high performance. The point to make here is that 
the profiles displayed in Figure 1 can only be obtained if it is clearly and well established 
that the domain is multidimensional to at least some degree before items are developed. 

If the domain is not multidimensional, then profiles like that shown for Student C will not 
be realized. 


1 Among the methods used to determine if subtest scores are distinct are (a) correlations corrected for attenuation 
(Haladyna & Kramer, 2004; McPeek, Altman, Wallmark, & Wingersky, 1976), (b) agreement method (Babenko 
& Rogers, 2014; Kelly, 1923; Gulliksen, 1950; Lord & Novick, 1968), and (c) generalizabilty analysis (Rogers & 
Radwan, 2012). The proportional reduction in mean square error (Habennan, 2005; Sinharay, Habennan, & Puhan, 
2007; Sinharay, Puhan, & Habennan, 2009; Sinharay, 2010) can be used to detennine if a subtest score adds value 
over the total test scores. 
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Figure 1: Sample Student Profiles for Five Parts of the Curriculum 

Now, what has been said in the previous paragraph is predicated on the assump¬ 
tion that the subtests are reliable. But the reliabilities of subtests developed to measure 
subdomain performance are generally too low (Haladyna & Kramer, 2004; Luecht, 
Brumfield, & Breithaupt, 2006). Time limits for the administration of the total assess¬ 
ment instrument are set according to the age of the students and rarely exceed two hours, 
with perhaps an extra half hour as an accommodation for students who need more time. 
Consequently, the number of items for the subtests included in an assessment instrument 
is small, which in turn limits the reliability of the subtests. A sufficient number of items 
need to be included in each subtest so that the reliability of each subtest is adequate to 
report subtest scores. In all of the cases in which investigations have been conducted to 
see if subscore reporting is warranted, the main emphasis has been on developing a set 
of items for the full assessment that represents the full domain according to the table of 
specifications for the full assessment and with a two-hour test administration limit. 

Therefore, to warrant reporting of profiles of subtest scores, consideration needs 
to be given to three conditions when constructing assessment instruments. The first, a 
substantive condition, is that the construct or domain to be assessed is multidimensional. 
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The second and third conditions are statistical in nature, namely, that there are low cor¬ 
relations among subtests and high subtest reliabilities. 

Lack of Adequate Time to Profitably Use the Results 

Large-scale assessments are generally administered toward the end of the school year 
or, for semestered schools, the end of the semester. The staff of test agencies responsible 
for scoring students’ responses to open response items, analyzing the scored responses, 
equating one assessment to another within and/or across years, and preparing reports 
work diligently during July and the first half of August to get results to schools before the 
beginning of the next school year. 

Despite this effort, can principals and teachers do what they are supposed to do 
before school starts ? They need to do each of the following before classes begin: 

• interpret the results from the large-scale assessments; 

• integrate what they have learned with what they know from their own 
classroom assessments to identify strengths and weaknesses in student 
learning; 

• review what they did during the last year to allow the students to acquire the 
knowledge and skills their students were expected to learn; and 

• identify and make sound and credible changes to their teaching materials, 
activities, and instructional approach that will enhance the learning and 
achievement of the students they have for the coming year. 

Typically, teachers are expected back to school during the week before school 
starts or on the first day of school. Given the startup activities principals and teachers are 
responsible for at the beginning of the school year, they have little or no available time 
to use large-scale assessment results in a meaningful way and do what they need to do 
to improve instruction before the school year begins. Consequently, they often simply 
ignore the large-scale assessment results (Klinger & Rogers, 2011). 

Lack of Needed Assistance 

Principals and teachers need assistance with using large-scale assessment results in 
a meaningful way. They need assistance with interpreting the assessment results for 
their school, integrating the results with what they know from their own classroom 
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assessments, and how to use the combined results to identify strengths and weaknesses 
in the teaching materials, activities, and instructional approaches they use at the class 
level (Deluca & Klinger, 2010; Webber, Aitken, Lupart, & Scott, 2009). Teachers’ lack of 
understanding of the purposes and uses of large-scale assessments has a negative impact 
on their use of and attitude toward large-scale assessments (Aiken, 1991; Bracey, 2005; 
Burger & Krueger, 2003; Cannell, 1987; Earl, 2003; Fairhurst, 1993; Lewis, 2007; Smith, 
1991). 


Proposed Solutions 

Ensure Relevant and Trustworthy Diagnostic Evidence 

In order to find students with profiles like Student C in Figure 1, it is necessary that the 
curriculum be truly multidimensional. This requires that the parts of the curriculum are 
distinct and not highly correlated and the subtests developed to assess each part are com¬ 
posed of relevant and representative items, have adequate reliability, and do not correl¬ 
ate highly. Thus, the first step to take when developing an assessment is to carefully and 
deliberately consider whether or not the different subdomains of the curriculum are dis¬ 
tinct. It may well be that the subdomains are related, but there must be some uniqueness 
for each subdomain to warrant the claim of distinction. Members of a panel of experts 
in the subject area who are knowledgeable about the students to be assessed should 
independently detennine what makes each subdomain unique and then reach consensus. 
The uniqueness might include content and/or thinking skills that are needed to learn the 
content and acquire the skills for each subdomain. 

If the subdomains of the curriculum are judged to be at least partially distinct and 
items relevant to and representative of each subdomain have been constructed, then re¬ 
sponses from a representative sample of students gathered from a pilot study or field trial 
can be analyzed to provide empirical evidence that the domain is multidimensional. The 
responses of the students to each item in a subtest should correlate to the following: 

a. highly with the subtest score (i.e., high item discrimination within subtest); and 

b. lowly with the subtest scores from the other subtests (i.e., low item discrimination 
across subtests other than the subtest the item belongs to). 
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Given a. and b. are met, then 

c. the correlations among the subtest scores should be lower than the reliabilities of 
the subtests to be correlated. 

Change the Assessment Schedule 

The present schedule of annually assessing students in the same subject areas should be 
changed to allow sufficient time for principals and teachers to use subtest scores to plan 
what changes in instruction may be needed and to implement the changes in a reasonable 
and steady manner. To achieve this, the assessment schedule should be changed so that a 
given subject area is not assessed every year but the total number of times a student sits 
for the assessments remains the same or essentially the same—three to four sittings of 2 
hours each. Adoption of this proposal will 

1. allow principals and teachers more time to integrate the large scale assessment 
results with their own classroom assessment results, identify areas of strength and 
weakness, review what they did the previous year, and formulate and implement 
changes to their teaching materials, activities, and instructional approach to 
address weaknesses while maintaining strengths, 

and, at the same time 

2. allow a greater number of items for each sub test in order to ensure high subtest 
reliability. 

The three possible options that follow are provided to illustrate possible administration 
schedules. 

One of three assessments per year. If three subject areas (e.g., either mathematics, rea¬ 
ding, and writing, or literacy, mathematics, and science) are assessed each year, then the 
change would lead to one subject area being assessed each year in three sittings. Three 
years would be needed to complete a cycle as shown in Figure 2. Three sittings per year 
would triple the number of subtest items, thereby leading to an increase in subtest reliabi¬ 
lity. Principals and teachers would have a greater period of time between two consecutive 
assessments of the same subject area to integrate the large-scale assessment results with 
their own classroom assessment results, identify strengths and weaknesses, and formu¬ 
late changes to be made to the teaching materials, activities, and instructional approach, 
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perhaps during the first year. During the second year, they could try out the changes and 
make revisions as needed, and then implement the revisions in the third year. 

In the fourth year, the first subject area would be assessed again. Using the results 
from this second assessment, principals and teachers could be properly held accountable 
for the changes to the teaching materials, teaching activities, and/or instruction they had 
made in an attempt to improve student performance in their school between Year 1 and 
Year 4. 


Yr 1 

Reading 


r i 

And so on 

L_ A 


Yr 2 
Writing 


r i 

Yr 4 


r i 

Yr 3 

Reading 

L _ A 


Mathematics 

L _ A 


Figure 2: Assessment of One Subject Area Each Year 

Two of three assesments on year and one assesment the next year. Some may argue that 
three years between between the assessment of a subject area, two subject areas would be 
assessed in two or three sittings for each assessment in the first year and one subject area 
would be assessed in two or three sittings in the second year (see Figure 3). The advan¬ 
tage of the second option is that each subject area would be assessed every two years 
instead of every four years as in the first option (cf. Figures 2 and 3). The planning and 
implementation stage would be shortened to two years—planning and pilot testing during 
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the first year and full implementation in the second year. 

A potential disadvantage of this option is that the students would take two differ¬ 
ent assessments one year and one assessment in the next year. A further disadvantage is 
the number of items for each assessment would be reduced by a third if two sittings were 
used for each assessment, which could result in subtest reliabilities that are too low. 




Yr 2 

Mathematics 


r ^ 

Yr 4 


f ' 

Yr 3 Reading 

Mathematics 

<- 

Writing 

L _ A 


Figure 3: Assessment of Two Subjects in One Year and a Third Subject in the Next Year 

Two assessments each year. In some jurisdictions as many as four different subject areas 
are assessed in one year. The schedule in this case would look like what is presented in 
Figure 4 for literacy (reading and writing combined), mathematics, science, and social 
studies. Each subject area would be assessed every two years. The number of sittings for 
each subject area would be at least two. The advantage of this option is that each subject 
area would be assessed every two years as in the previous option (Figure 3) instead of 
every five years (Figure 2 with four assessments). As with the second option, a potential 
disadvantage of the third option is the number of items for each assessment would be 
reduced by a third if two sitting were used for each assessment, which could result in 
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subtest reliabilities that are too low. 



?U : ±' tSA-I 3 Si ipnre 



Social Studies 


Writing 

Mathemattics 


Figure 4: Assessment of Two Subjects Each Year 

A potential disadvantage of all three schedules is that teachers would concentrate 
more on the subject area(s) to be assessed that year and less on the non-assessed subject 
areas. This disadvantage could be addressed by ensuring that teachers follow through 
with planning and implementation of the changes identified from the previous year’s 
assessment(s). 

Increased Reliability 

At different points an increase in reliability has been mentioned. If the first option (Figure 

2) were to be adopted, then the number of items would triple. If either the second (Figure 

3) or third (Figure 4) options were adopted, then the number of items would at least dou¬ 
ble. Based on the subtest reliabilities observed for different assessments conducted today, 
subtest reliabilities typically range from about 0.50 to 0.80, with a median and mean 
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close to 0.65. As shown in Table 1, application of the Spearman-Brown formula (Lord & 
Novick, 1968, p. 112) would yield a range from 

1. 0.66 to 0.89 with a median and mean close to 0.79 if the number of items for 
each subtest were doubled 

and 

2. 0.75 and 0.92 with a median and mean of 0.85 if the number of items in each 
subset were tripled. 


Table 1: Subtest Reliabilities if Double or Triple Number of Items 


Initial 

Test Length 

Reliability 

Double 

Triple 

0.50 

0.67 

0.75 

0.60 

0.75 

0.82 

0.65 

0.79 

0.85 

0.70 

0.82 

0.88 

0.80 

0.89 

0.92 


Reliabilities below 0.80 are perhaps too low, but if the recommendation given ear¬ 
lier to select items for a subtest for which the student responses correlate highly with the 
subtest score and lowly with the other subtest scores is accepted, then adequate reliability 
would likely be obtained. 

Provide Adequate Support 

Principals and teachers need assistance with interpreting and using large-scale assess¬ 
ment results for their schools. Personal assistance provided by members of an assessment 
agency’s outreach team throughout the year and interactive online reporting mechanisms, 
where principals and teachers can compare the perfonnance of their schools with schools 
with similar bio-demographic characteristics, are two ways the needed assistance can be 
continuously provided. Although teachers can use documents like the Student Evaluation 
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Standards (Joint Committee on Standards for Education Evaluation, 2003) and the 
Principles for Fair Student Assessment Practices for Education in Canada (Centre for 
Research in Applied Measurement and Evaluation, 1993) to assist them to improve their 
own assessment practices, many do not, likely because of a lack of relevant knowledge 
and skills and confidence. There clearly is a need to support principals and teachers to use 
large-scale assessment results in a reasoned and effective way to improve student learning 
and performance (Webber et ah, 2009). 


Conclusion 

Large-scale assessments have assumed the preeminent role in educational accountability 
and refonn because they provide a common and, ostensibly, a fairer yardstick to monitor 
student achievement over time and to compare schools and school districts. They are rela¬ 
tively efficient and, importantly, the results are visible (Linn, 2000). But, with one or two 
exceptions (e.g., Japan), have we seen much change in student performance and are the 
results used as expected, particularly at the local school level? To correct this situation, 
we strongly recommended the following be considered: 

1. it be clearly confirmed that the curriculum is indeed multidimensional; 
and, if confirmed, that 

2. procedures for item analysis be expanded to include both the sub test to which an 
item is referenced and the subtests to which the item is not referenced; 

3. the assessment schedule be changed to 

a. provide more time for educators to review the results to identify areas of 
strength and areas of weakness, to formulate changes to address weakness 
while maintaining strength, acquire any needed materials, and implement the 
changes in a reasonable and thoughtful way, and 

b. allow a greater number of items for each subtest, thereby increasing subtest 
reliabilities; 

and 

4. assistance be provided to school principals and teachers to help them work with 
the large-scale assessment results and the knowledge they have about their own 
instruction during the last year to make changes so as to increase student learning 
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and achievement. 

Principals and teachers must be provided with reliable profiles that can be validly 
interpreted, and they must have adequate time and assistance to make needed changes to 
enhance learning and achievement of all of their students. 
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